Search CORE

6 research outputs found

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Author: de Supinski Bronis R.
Feng Wu-chun
Rountree Barry
Scogland Thomas R. W.
Publication venue
Publication date: 01/01/2012
Field of study

Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

Computer Science Technical Reports @Virginia Tech

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Author: Thomas R. W. Scogland
Wu-chun Feng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/11/2015
Field of study

As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of pro-gramming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a char-acterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput, linearizable, blocking, concurrent FIFO queue for many-core architectures that avoids the bottlenecks and pitfalls com-mon in modern queue designs, and (3) a thorough evalu-ation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focus-ing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magni-tude (1000×) faster than lock-free and combining queues on GPU platforms and two times (2×) faster on CPU de-vices. These results deliver critical insights into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can increase throughput. 1

CiteSeerX

Crossref

A power-measurement methodology for large-scale, high-performance computing

Author: Calcul Québec
Craig P. Steffen
Erich Strohmaier
Florent Parent
Natalie Bates
Susan Coghlan
Thomas R. W. Scogland
Torsten Wilde
Wu-chun Feng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Improvement in the energy efficiency of supercomputers can be accelerated by improving the quality and comparability of efficiency measurements. The ability to generate accu-rate measurements at extreme scale are just now emerging. The realization of system-level measurement capabilities can be accelerated with a commonly adopted and high quality measurement methodology for use while running a workload, typically a benchmark. This paper describes a methodology that has been developed collaboratively through the Energy Efficient HPC Work-ing Group to support architectural analysis and compara-tive measurements for rankings, such as the Top500 and Green500. To support measurements with varying amounts of effort and equipment required we present three distinct levels of measurement, which provide increasing levels of ac-curacy. Level 1 is similar to the Green500 run rules today, a single average power measurement extrapolated from a subset of a machine. Level 2 is more comprehensive, but still widely achievable. Level 3 is the most rigorous of the three methodoloiges but is only possible at a few sites. How-ever, the Level 3 methodology generates a high quality re-sult that exposes details that the other methodologies may miss. In addition, we present case studies from the Leibniz Supercomputing Centre (LRZ), Argonne National Labora-tory (ANL) and Calcul Québec Universite ́ Laval that ex-plore the benefits and difficulties of gathering high quality, system-level measurements on large-scale machines

CiteSeerX

Crossref

Recommended from our members

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

Author: de Supinski Bronis R.
Duran Alejandro
Neth Brandon
Scogland Thomas R. W.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/09/2021
Field of study

OpenMP began supporting offloading in version 4.0, almost 10 years ago. It introduced the programming model for offload to GPUs or other accelerators that was common at the time, requiring users to explicitly transfer data between host and devices. But advances in heterogeneous computing and programming systems have created a new environment. No longer are programmers required to manage tracking and moving their data on their own. Now, for those who want it, inter-device address mapping and other runtime systems push these data management tasks behind a veil of abstraction. In the context of this progress, OpenMP offloading support shows signs of its age. However, because of its ubiquity as a standard for portable, parallel code, OpenMP is well positioned to provide a similar standard for heterogeneous programming. Towards this goal, we review the features available in other programming systems and argue that OpenMP expand its offloading support to better meet the expectations of modern programmers. The first step, detailed here, augments OpenMP’s existing memory space abstraction with device awareness and a concept of shared and managed memory. Thus, users can allocate memory accessible to different combinations of devices that do not require explicit memory transfers. We show the potential performance impact of this feature and discuss the possible downsides.12 month embargo; first online: 08 September 2021This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

The University of Arizona

The Ongoing Evolution of OpenMP

Author: Bellido Sergi Mateo
de Supinski Bronis R.
Duran Alejandro
Klemm Michael
Mattson Timothy G.
Olivier Stephen L.
Scogland Thomas R. W.
Terboven Christian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Publikationsserver der RWTH Aachen University

The Ongoing Evolution of OpenMP

Author: Alejandro Duran
Bronis R. de Supinski
Christian Terboven
Michael Klemm
Sergi Mateo Bellido
Stephen L. Olivier
Thomas R. W. Scogland
Timothy G. Mattson
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref